To briefly recap, last week we
group_by()ggplot() and identify some common plotting issuesWhich led nicely onto…
First of all there isn’t a right or a wrong way of doing last week’s homework. Actually, this is usually the case with R - there are better (easier to read code, less likely to make mistakes, faster) and worse (the opposite of all of those things) ways of doing things but not neccesarily a right and wrong way. Your homework was about thinking outside of the box, being creative, investigating the data, and trying to come up with an elegent visualisation. Here is one option, working from the long format of the data you generated last week:
## make a new column in long_spp which is the number of threats recorded for each population
##this is the length of the non-NA values in each row of the colunns primary_threat, secondary_threat, and tertiary_threat:
long_spp <- long_spp %>%
rowwise %>%
mutate(n.threats = length(which(!is.na(c(primary_threat,
secondary_threat,
tertiary_threat)))))
##load in the scales package for the psudo_log transformation
library(scales)
##build the ggplot, setting the date to a date format
ggplot(long_spp, aes(x=as.Date(date),
y=abundance,
col=n.threats))+
##set the theme to minimal
theme_minimal() +
##position the legend at the top on the left
theme(legend.position = "top",
legend.justification="left") +
## change the automatic label for the legend to the following
## note it is done twice (once for fill and once for col) as
## we change the colour of the lines and the fill of the smooth according to the number of threats
labs(col="Number of threats", fill="Number of threats") +
## add in the lines of the population, using the group and interaction functions
## to specify how the data are grouped and get 1 line per population
geom_line(aes(group = interaction(population,
species,
n.threats,
population)),
alpha=0.2,
size=1) +
## add in a smoothed line across all of the population time series for each number of threats
## this visualises the mean change across the populations
geom_smooth(aes(group = n.threats,
fill = n.threats), alpha=0.3, size=1) +
##change the colours to reflect the number of threats, using a gradient
scale_color_gradient(low = "#745745",
high = "#d14124") +
scale_fill_gradient(low = "#745745",
high = "#d14124") +
## transform the y axis using a psudo_log transformation
## (as we have some 0 counts and you can't log 0)
scale_y_continuous(trans="pseudo_log") +
##change the x and y axis labels
ylab("Population abundance") +
xlab("Year")
So here we have thought a little outside the box and looked at how the number of stressors that a population is threatened by alters its trajetory through time - it seems like the more threats you have the faster you are declining, on average.
How you visualise your data will of course depend on exactly the questions you are interested in. Plotting as above gives the advantage of being able to directly compare between the different groups (0, 1, 2, 3 stressors) but at the expense of the species-level data.
A couple of other neat tricks are worth highlighting:
So far you have learned to read in data, manipulate that data, and then do basic visulisation of that data. This week we are going to take the obvious next step and consider how we run statistical analyses in R.
The first point to make here is this is going to be a brief introduction to some useful statistical methods. R was originally developed as a statistical programming language (i.e. not primarily used for data visualisation and manipulation) and consequently any analysis you can think of you can probably do in R (including implementing neural networks, MCMC bayesian analyses, social network analysis, etc, etc).
We are going to focus on a single suite of modelling methods in-depth, and analyse the species and threats data you plotted for homework.
Generalized Linear Models (GLMs) are a critical and hugely useful technique for data analysis. They are Generalised Linear Models because they allow you to specify the error distribution of your model (e.g. gaussian, poisson, etc) and link function (e.g. identity, log, etc). This means that they are hugly flexible you can fit them to count data, binary data, or data where the predictor variables are categorical or continuous. Critically, GLMs can be fit to data where the errors are normal or non-normally distributed. Linear regressions - which I am sure you are all familiar with - are a special case of the GLM, where data has a gaussian (normal) distribution with an identity link (see below).
As we said above, GLMs are fabulously flexible - they can do ANOVA style analyses with categorical predictors, linear regressions with continuous predictor variables, and mix both categorical and continuous predictors in multiple regression analyses. We will also consider the extension to GLMs where we can account for non-randomness between samples (so called mixed models). GLMs are complex beasts, but are widely used to analyse biological data because of its often nested nature, and lack of normality, and so we are going to cover this topic in-depth here rather than spread thinly across many different statistical techniques. We will touch on some other statistics in upcoming lectures.
The following two sections are a quick refresher on error distributions and link functions. Feel free to skip these if you are comfortable with the topics, or come back to them for homework if you aren’t as familiar with the content.
An error distribution specifies the structure of the error in the model - i.e. the variance in the data not explained by the predictor variables. Take the example of a gaussian (normal) distribution - we expect the residuals of the model to be normally distributed around the fitted model:
Here we have a linear model (blue horizontal line in the Fitted model pannel) where there is no effect of a change in the x value on the y values (i.e. the slope is ~0). The data are plotted in light blue, and the residual error (difference between the predicted values - i.e. the blue horizontal line - and the observed values - i.e. the light blue dots - is shown by the grey lines). You can see that the residuals are normally distributed around that fitted blue line (see histogram of residuals). Remember, we are trying to minimise the residual error as less error = a model which fits the data better. So we are looking for a roughly normal distribution of errors in our histogram of the residuals (☑), with a roughly linear patter in our fitted vs residual plot (☑).
If we now fit a gaussian model to count data:
You will see that the residuals are very not normally distributed (because we are working with counts which are (1) integer values (whilst the normal distribution produce any real number value between negative infinitly and positive infinity), and (2) count data are bounded at 0.
Error distributions allow us to deal with these non-normally distributed data, in the above example we would probably use a poisson distribution (specifically designed to cope with positive integer values).
For more information on the error families for glm’s have a look at the help - ?family.
Link functions are - as the name suggests - the link between the data and the error distribution. The easiest way to think of them is a transformation applied to the data before it is passed to the model. The link function relates the mean value of y to the linear predictor (x).
Again, taking the example of the normal distribution - the default link function is an “identity” link. This is basically saying that we don’t need to do any transformation to the data - we are assuming that its error follows the normal distribution.
If we look at the histogram then the distribution is rougly normal (if we squint a lot) but now our fitted vs residual plot shows a clear banana curve in the plotted points - we want to avoid this.
If we then fit this model with a log link:
You can see that it hasn’t changed our histogram of residuals much, but has changed the trend in the fitted vs residuals plot, which now looks good.
Note - looking for a normal distribution of residuals is useful when the residuals for a distribution are EXPECTED to be normally distributed. This means that residual plots like those above are useful for some error distribution families, but not all.
This is quite a complex topic, and one which can be hard to get your head around. If you aren’t comfortable with these issues then I suggest you do some reading to familiarise yourself with them. Here are a few places to get started:
So, let’s think about fitting a GLM to data. We will work first of all on a single population time series from the species data you plotted for homework last week and we are going to see whether there has been a significant increase, decrease, or no change in that population over time, and if there is a change what the rate of that change is. We will go through the process of fitting a glm to the data, checking the fit and other potential models, and then think about plotting the outputs of the model.
Task
Ok, so first off let’s pull out the data. Load in the species data you used last week from the guthub repository (using vroom and the url to the data), and then reshape your data into long format. However, we arent interested in any values which have missing data, so we want to drop NA values during the pivot. You will need to modify your code to do this.
First off, let’s take a quick look at this big data set to refresh our memory on the structure.
Task
Use the print() function to look the data.
If we look at this print out we can see that the date column isn’t specified in the date format, so let’s set that to a date first.
Task
Set the date column to date format.
So, the important bits for us (remember we want to know if there has been a significant change in the population over time) are going to be the date, abundance, and species columns for the moment.
As we said, we are going to use a single time series, so let’s extract the time series - using filter() for Trichocolea tomentella from the species data we loaded in.
Task
Filter the data to make a new data frame called single_spp containing only data on Trichocolea tomentella
And, as we should before all data analysis, we should visualise the data before we do anything. We will do this using ggplot() and fit a loess smoothing line so we can visualise any trends in the data:
##make the plot
p1 <- ggplot(single_spp, aes(x=date, y=abundance)) +
geom_point() +
geom_line() +
theme_bw() +
ylab("Abundance") +
xlab("Year")
##add the loess smoothing:
p1 + geom_smooth(method="loess")
Ok, so I think its pretty clear that the population is declining! But we want to know about how fast that population is changing - so we should fit a model to these data.
To run a glm() in R is easy - its a base function (so no packages need installing).
One thing to consider is the x-axis on the plot above - R will accept date as a predictor variable (make sure it is set as a date!) but it is often better to replace the date with a numeric vector as it is easier to interpret (see below in the Model summaries and outputs section) - we are going to do this with the code below:
## calculate a new column (`standardised_time`) which is the difference between the
## starting date of the time series and each other date in weeks (see ?difftime)
## we will set this to a numeric vector
single_spp <- single_spp %>%
mutate(standardised_time = as.numeric(difftime(as.Date(date),
min(as.Date(date)),
units = "weeks")))
print(single_spp[,c("abundance", "date", "standardised_time")], 30)
## # A tibble: 18 x 3
## abundance date standardised_time
## <dbl> <date> <dbl>
## 1 245 2003-01-01 209.
## 2 231 2004-01-01 261.
## 3 125 2005-01-01 313.
## 4 121 2006-01-01 365.
## 5 125 2007-01-01 417.
## 6 87 2008-01-01 470.
## 7 79 2009-01-01 522.
## 8 97 2010-01-01 574
## 9 42 2011-01-01 626.
## 10 35 2012-01-01 678.
## 11 41 2013-01-01 731.
## 12 49 2014-01-01 783.
## 13 55 2015-01-01 835.
## 14 40 2016-01-01 887
## 15 336 2000-01-01 52.1
## 16 212 2001-01-01 104.
## 17 204 2002-01-01 157.
## 18 368 1999-01-01 0
Now to fit our glm:
##fit a glm()
mod1 <- glm(abundance ~ standardised_time,
data = single_spp,
family = "gaussian")
Here we specify the formula of the model using ‘~’, where the first term is the dependant (y) variable, and the following term/terms is/are the independant variables. So “abundance” is our dependant vairaible, and standardised_time is our independant variable.
Hint - you can read ~ in R as “as a function of”. So in this case, we are testing whether the abundance of the species is a function of standardised time.
We have also specified the “family” to be “gaussian” (actually we didnt need to do this as this is the default setting, but its good practice to make sure we understand what exactly we are doing). Family here means the error structure, so we are assuming (initially) that our residuals are normally distributed, this means that - given our current settings - we are actually just fitting a linear regression to the data.
Statistics is as much art as science, and assessing how well a model fits data is complex and time consuming. There isn’t an automated or best approach for doing it.
Remember:
As an initial pass we can produce plots as I showed you above to visualise how the model fits the standardised_time, the residuals, and a few other tools for assessing how well the model fits.
To do this we need to extract some values from the model, specifically thepredicted y values for each x value from the model, and the residuals. We can do this using the predict() and resid() functions:
##return the predicted (response) values from the model and add them to the single species tibble:
single_spp$pred_gaussian <- predict(mod1,
type="response")
##return the model residuals and add to the single species tibble:
single_spp$resid_gaussian <- resid(mod1)
Now we’ve added these data to our original data frame we can start to plot some of the plots in the style of the Error distributions section above.
First off let’s plot the data again, and add in the predicted values from the model as a line:
## plot the abundances through time
p2 <- ggplot(single_spp, aes(x = standardised_time,
y = abundance)) +
geom_point() +
geom_line() +
theme_bw() +
ylab("Abundance") +
xlab("standardised_time")
##add in a line of the predicted values from the model
p2 <- p2 + geom_line(aes(x = standardised_time,
y = pred_gaussian),
col = "dodgerblue",
size = 1)
## we can also add in vertical blue lines which show the redsidual error of the model
## (how far the observed points are from the predicted values).
## in geom_segement we specify where we want the start and end of the segments (lines)
## to be. Without any prompting ggplot assumes that we want the start of the lines to
## be taken from the x and y values we are plotting using the ggplot() function
## i.e. standardised_time and abundance, so we just need to specify the end points of
## the lines:
p2 <- p2 +
geom_segment(aes(xend = standardised_time,
yend = pred_gaussian),
col="lightblue")
## add a title
p2 <- p2 + ggtitle("Fitted model (gaussian with identity link)")
##print the plot
p2
Eye-balling this it looks like a pretty good fit, which is nice.
Next, let’s check the residuals around the fitted values:
##plot a histogram of the residuals from the model using geom_histogram()
p3 <- ggplot(single_spp, aes(x = resid_gaussian)) +
geom_histogram(fill=I("goldenrod")) +
theme_minimal() +
ggtitle("Histogram of residuals (gaussian with identity link)")
## print the plot
p3
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The residuals aren’t what you would describe as “normally distributed” which is a concern.
Let’s look at how the residuals change with the predicted values.
Task
Try to make the plot below using your own code, plotting the predicted vs residuals in the single_spp data.frame
##make the plot of predicted vs residuals
p4 <- ggplot(single_spp,
aes(x = pred_gaussian,
y = resid_gaussian)) +
geom_point() +
theme_minimal() +
xlab("Predicted values") +
ylab("residuals") +
ggtitle("Predicted vs residual (gaussian with identity link)") +
##using geom_smooth without specifying the method (see later) means geom_smooth()
##will try a smoothing function with a formula y~x and will try to use a loess smoothing
##or a GAM (generalised additive model) smoothing depending on the number of data points
geom_smooth(fill="lightblue", col="dodgerblue")
##print it
p4
Your plot definately shows a banana shape, not the required straight line. Actually we could probably have predicted this - if we look at the first plot we can see that at standardised_time ~ 0 and standardised_time > 750 the line constantly underpredicts the observed values, and hence we have higher residual errors - reflected in the banana shape curve you see above.
Let’s look at one last diagnostic plot, the Q-Q plot. If you aren’t familiar with QQ plots then find some information here. R has a helpful function to make these plots automatically:
##plot the qq plot for the residuals from the model assuming a normal distribution,
## and add the straight line the points should fall along:
qqnorm(single_spp$resid_gaussian); qqline(single_spp$resid_gaussian)
We can see that these point also deviate from the expected straight line. QQ plots can give us quite a lot of information about the discrepencies in the data when compared to the theoretical disribution (in this case the gaussian) we have assumed describes the residuals in the model.
Task
Have a look at this application. Explore how the different skewness and tailedness in the data affect the QQ plots, and then decide what kind of skewness/tailedness we might have in our data plotted above.
Using this application it seems our data is likely to be negatively skewed. Let’s keep that in mind for later.
So what are our conclusion from this preliminary investigation? Well, we can see that the model does an OK but not great job of fitting to the data - we need to investigate some other possible models to see if we can identify one which does a better job.
So the next thing to do is fit some alternative models to the data to see if they do a better job.
The data are counts (abundance of species at each standardised_time) so one very sensible option is a glm with poisson error distribution.
## fit a glm with a poisson distribution
mod2 <- glm(abundance ~ standardised_time,
data = single_spp,
family = "poisson")
Another option, give the skewness in the data we explored above, would be a gaussian glm with a log-link, which might help solve our skewness issue…
## fit a glm with a gaussian distribution with a log link
mod3 <- glm(abundance ~ standardised_time,
data = single_spp,
family = gaussian(link = "log"))
We could also try a guassian model with an inverse link:
## we could also try a guassian model with an inverse link
mod4 <- glm(abundance ~ standardised_time,
data = single_spp,
family = gaussian(link = "inverse"))
Ok, so now we have four potential models (mod1, mod2, mod3, mod4). We can compare the fits of these models to the data using the Akaike information criterion (AIC):
##compare the models
AIC_mods <- AIC(mod1,
mod2,
mod3,
mod4)
## rank them by AIC using the order() function
AIC_mods[order(AIC_mods$AIC),]
## df AIC
## mod3 3 175.4301
## mod4 3 183.4976
## mod1 3 191.2363
## mod2 2 210.4472
Task
Try to work out how the order() function works - and why we use it in the square brackets for AIC-mods. To do this try running chunks of the code above in isolation.
Great. So it looks like the guassian model with a log-link (mod3) fits the data the best (has the lowest AIC). Conveniently none of these models are close (within 2 AIC) of the best fitting model, so we can concentrate on mod3 moving forward.
Whilst AIC tells us about the comparitive fits of the different models, it doesn’t tell us how well the models actually fit the data. They could all just fit it really badly, and mod3 just fits it least badly! So we need to go back and check the fits of the model again.
##return the predicted (response) values from the model and add them to the single species tibble:
single_spp$pred_gaussian_log <- predict(mod3,
type="response")
##return the model residuals and add to the single species tibble:
single_spp$resid_gaussian_log <- resid(mod3)
##first off let's plot the data again, and add in the predicted values from the model as a line. We can modify the plot we started earlier:
p5 <- ggplot(single_spp, aes(x=standardised_time, y=abundance)) +
geom_point() +
geom_line() +
theme_bw() +
ylab("Abundance") +
xlab("standardised_time")
##add in a line of the predicted values from the model
p5 <- p5 + geom_line(aes(x = standardised_time,
y = pred_gaussian_log),
col = "dodgerblue",
size = 1)
## we can also add in lines showing the distance of each observation from
## the value predicted by the model (i.e. these lines visualise the residual error)
p5 <- p5 + geom_segment(aes(xend = standardised_time,
yend = pred_gaussian_log),
col="lightblue")
## add a title
p5 <- p5 + ggtitle("Fitted model (gaussian with log link)")
##print the plot
p5
That looks like a good fit! But we need to check our other metrics we plotted out earlier. There is actually a convenient short-hand way of doing this in R - we can tell it to plot() a glm() object and it will automatically produce the following:
##plot the diagnostic graphics for model 3
plot(mod3)
Breifly, for the above plots:
In all, we have a model which fits the data well and which we can be pretty confident in.
Note - For a more indepth discussion of how to interpret the above diagnostic plots see the excellent blog post here.
One thing we should bear in mind is that the gaussian distribution is not specifically formulated for count data - it assumes that error values are going to be continuous, whereas we know that isn’t the case with our data. This can be a cause for concern, and whilst strictly we shouldnt do this (as it violates the assumptions of the normal distribution) we can usually get away with it when the mean of the dependant variable is high (as a rule of thumb > 50 - which is the case here).
So we have fit our model, and ensured that our model fits the data well. Now we want to look at what our model says about the data.
First off let’s look at the model output, using the summary() function:
##summarise the model outputs
summary(mod3)
##
## Call:
## glm(formula = abundance ~ standardised_time, family = gaussian(link = "log"),
## data = single_spp)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -58.133 -18.337 0.437 16.949 55.486
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.8867705 0.0534321 110.2 < 2e-16 ***
## standardised_time -0.0027565 0.0002377 -11.6 3.38e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 806.5102)
##
## Null deviance: 184068 on 17 degrees of freedom
## Residual deviance: 12904 on 16 degrees of freedom
## AIC: 175.43
##
## Number of Fisher Scoring iterations: 4
You can see this gives us further details on the model. The important bits for us at this stage are the coefficients.
Note - for a full discussion of the output see the excellent posts here and here.
We can see there is a significant negative effect of standardised_time on the dependant variable (abundance). Its now easy to see why we have convereted the date to standardised_time, having converted these to a continuous numeric predictor (number of weeks since the first recorded abundance) we can more easily interpret the output: there is a decrease of 0.003 individuals per week over the time series.
We can plot the output of this model:
## first off let's plot the data again
p6 <- ggplot(single_spp, aes(x=standardised_time, y=abundance)) +
geom_point() +
geom_line() +
theme_bw() +
ylab("Abundance") +
xlab("standardised_time")
## use the geom_smooth() function to add the regression to the plot.
## unlike earlier here we are specifying the model type (glm), the formula,
## and the error structure and link
p6 <- p6 + geom_smooth(data=single_spp,
method="glm",
method.args = list(family = gaussian(link="log")),
formula = y ~ x,
col = "dodgerblue",
fill = "lightblue")
##print the plot
p6
You can see that ggplot() conveniently caculates the confidence intervals around the line, giving us a nice visualisation of the fitted model.
Generalized linear mixed-effect model (GLMMs) allow us to model data which are non-normally distributed and also include random effect structures within the data. These random effects could be differences between sites where we have collected data, data collected from different populations of the same species, etc and mean that there might be additional structure in the data which is not explained by our fixed effects. There are two kinds of random effect - random slopes, and random intercepts. In most cases we specify a random intercept - i.e. we assume that there are differences between the groups in where the regression line fit through the data cross the y axis, but there aren’t differences between the groups in the rate of change of that regression line (the slope).
If you aren’t comfortable with the concepts of mixed effect models then there is a fantastic visual explanation of the effects of data grouping available here which I recommend going over either before you continue or after this class - it makes GLMMs very intuitive.
GLMMs are a pain to fit well, and complex to investigate/assess. A very very good place for advice is Ben Bolker’s excellent post on frequently asked questions for GLMMs available here. There is another complimentary post where he works through a series of GLMM examples available here.
We will do a brief investigation of GLMMs in R (again, this is a quick intro rather than an indepth dive - there is far too much to learn in 3 hours but this will give you the skills to fit GLMMs and learn how to learn statistics in R). For this let’s use the full species and stressors data base we filtered for the GLM section.
Again, we will add in a standardised time column to our data, starting at 0 weeks for the earliest recorded abundance for each population of each species:
## calculate standarised time per pop and per species.
long_spp <- long_spp %>%
group_by(species, population) %>%
mutate(standardised_time = as.numeric(difftime(as.Date(date),
min(as.Date(date)),
units="weeks")))
## just look four of the columns to visualise the data
## (you can look at all in Rstudio) but printing all the columns in
## Rmarkdown means we won't see the standardised_time column
print(long_spp[,c("species", "population", "abundance", "standardised_time")], 10)
## # A tibble: 696 x 4
## # Groups: species, population [59]
## species population abundance standardised_time
## <chr> <chr> <dbl> <dbl>
## 1 Schistidium helveticum pop_1 353 0
## 2 Schistidium helveticum pop_1 419 52.1
## 3 Schistidium helveticum pop_1 359 104.
## 4 Schistidium helveticum pop_1 363 156.
## 5 Schistidium helveticum pop_1 321 209.
## 6 Schistidium helveticum pop_1 280 261.
## 7 Schistidium helveticum pop_1 379 313
## 8 Paraleucobryum longifolium pop_1 24 0
## 9 Paraleucobryum longifolium pop_1 12 52.1
## 10 Paraleucobryum longifolium pop_1 18 104.
## # … with 686 more rows
Task
There are a number of ways we could create the standardised time column (e.g. with week 0 set to the earliest recorded abundance of any species and population in the data, or creating the standardised time as we have done above where each species population has its own week 0). Each of these has different implicationsand makes different assumptions about what patterns we are looking for in the data. Try and identify what the implications of choosing these different standardised times would be.
In the introduction to this week’s workbook I gave a possible solution for last week’s homework:
We will set out to see - statistically - if there is any change in the abundances of the different species over time based on the number of stressors (as in the plot above).
Task
Implement the code (given at the beginning) to calculate the number of stressors for each populatin of each species.
Great, so we now have a data frame with standardised times and the number of stressors added to it, let’s first visualise the data:
## we will specify the x and y asethics here but specify the groups for the lines later
## this is because doing so means that the data for the geom_smooth() will then be all of the
## data for each of the facets, but we can specify grouped data for the lines in geom_line
ggplot(long_spp, aes(x = standardised_time,
y = abundance)) +
##add the points. we can use the group function to specify
## groups within the data (in this case species and population)
## that we want to treat as seperate data for the sake of plotting
## the `interaction()` function allows us to specify multiple levels to these data
geom_line(aes(group = interaction(species, population))) +
facet_wrap(~n.threats) +
theme_minimal() +
##fit a linear regression to the data to help visualise
geom_smooth(method = "lm")
## I havent run this, but here is an example of what happens when we group
## in the initial ggplot() function rather than in the geom_line()
## try running it and seeing what happens
ggplot(long_spp, aes(x = standardised_time,
y = abundance,
group = interaction(species, population))) +
geom_line() +
facet_wrap(~n.threats) +
theme_minimal() +
geom_smooth(method = "lm")
So, eyeballing it it looks like there might be some differences between the treatments (number of threats) - although if you run the second chunk of code above where a linear model is fit to each time series you will see that there is some variation in those rates of change through time.
Ok, let’s think about the modelling framework. We know we have the following variables in the data which we might want to consider:
abundance - this will be our response variablestandardised_time - we want to understand whether the number of threats affects the populations through time, so we still need to include the time elementn.threats - the number of threatsspecies - looking at the above plots it seems like the species are potentially behaving in slightly different wayspopulation - for some of the species we have multiple populations, which we might want to consider seperatly.Great so from this we can already start to think about a model structure based on what we are interested. We know its going to be something like:
abundance ~ standardised_time + n.threats
But remember we want to see whether the number of threats affects how population abundance changes through time, so we are going to need to consider an interaction between standardised_time and n.threats. We specify an interaction in R using ::
abundance ~ standardised_time + n.threats + standardised_time:n.threats
Great. Now let’s consider our other variables - species and population. We could consider species (and indeed population) as fixed effects in the model - if we were interested in them as part of the main part of our question. However, what we really want to know is what the effect of the number of threats is on abundance, we aren’t interested in the species effects per say.
This is where mixed models come to the fore - we want to look at what the effects of standardised_time and n.threats whilst accounting for possible other variables (species & population) which might explain some variation in the data but which we aren’t interested in the effects of.
There are a number of packages for fitting GLMMs in R - we are going to use the glmmTMB package because its really fast so is great for big data sets, and has good online help/tutorials.
Task
find, install, and load the glmmTMB package.
There are some differences between how GLMMs are coded between the different packages, but generally:
+ (1 | group)+ (0 + x | group)+ (x | group)We are going to fit a random intercept model with intercepts for each of the species in our data using glmmTMB:
##fit a random effects model
m_mod1 <- glmmTMB(abundance ~
standardised_time +
n.threats +
standardised_time:n.threats +
(1|species),
data=long_spp,
family="gaussian")
##look at the model summary:
summary(m_mod1)
## Family: gaussian ( identity )
## Formula:
## abundance ~ standardised_time + n.threats + standardised_time:n.threats +
## (1 | species)
## Data: long_spp
##
## AIC BIC logLik deviance df.resid
## 8002.8 8030.1 -3995.4 7990.8 690
##
## Random effects:
##
## Conditional model:
## Groups Name Variance Std.Dev.
## species (Intercept) 9602 97.99
## Residual 4447 66.69
## Number of obs: 696, groups: species, 51
##
## Dispersion estimate for gaussian family (sigma^2): 4.45e+03
##
## Conditional model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 255.547671 16.536993 15.453 <2e-16 ***
## standardised_time 0.038182 0.018306 2.086 0.037 *
## n.threats -69.127191 5.394216 -12.815 <2e-16 ***
## standardised_time:n.threats -0.091489 0.009579 -9.551 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
So if we look at this output we can see that the amount of the variance the random effect (species) accounts for in our model is given, as is the residual variance. Remember that we are assuming that species (the random effect) accounts for all the variance in the data not explained by our fixed effects (standardised_time and n.threats and their interaction).
We can calculate how much of the variance the random effect accounts for as the variance of the random effect divided by the sum of the random effect variance and residual variance:
Task
Calculate the variance explained by the random effect
9602 / (9602+4447)
## [1] 0.683465
So about 70%.
However, whilst we expect there to be differences in the intercept between the species (as coded before) there is also a good reason to think that there might be differences between populations of the same species as well. We call this a nested random effect.
Nested random effects allow us to specify that some random effects (in our case population) should be considered within a larger random effect (species) rather than seperately in the model. In our case we don’t wan’t to consider species and population seperatly because:
It can be hard to get your head around this, but considered say a data set where we have the test results of individuals in classes in different schools. We want to know say if there is an effect of gender on the test results whilst accounting for the potential random effects of school and class. Clearly, the random effects of class should be nested in school (as the school determines the environment, practices, approaches, etc) which happen in the class, but the class identity might still have an effect as each class is taught by a different teacher.
So back to our example, we want to nest population inside species, and do this as follows:
##fit a random effects model with nested random effects
m_mod2 <- glmmTMB(abundance ~ standardised_time + n.threats + standardised_time:n.threats + (1|species/population),
data=long_spp,
family="gaussian")
##look at the model summary:
summary(m_mod2)
## Family: gaussian ( identity )
## Formula:
## abundance ~ standardised_time + n.threats + standardised_time:n.threats +
## (1 | species/population)
## Data: long_spp
##
## AIC BIC logLik deviance df.resid
## 7911.9 7943.7 -3949.0 7897.9 689
##
## Random effects:
##
## Conditional model:
## Groups Name Variance Std.Dev.
## population:species (Intercept) 8.988e+03 9.481e+01
## species (Intercept) 5.554e-93 7.452e-47
## Residual 3.755e+03 6.128e+01
## Number of obs: 696, groups: population:species, 59; species, 51
##
## Dispersion estimate for gaussian family (sigma^2): 3.76e+03
##
## Conditional model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 208.55777 18.16210 11.483 < 2e-16 ***
## standardised_time 0.07329 0.01797 4.079 4.53e-05 ***
## n.threats -33.47467 8.30603 -4.030 5.57e-05 ***
## standardised_time:n.threats -0.11393 0.00934 -12.198 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Now we can see that we have a new term (population:species) in our “conditional model” and again we can check how much variance is explained by these random effects.
Task
Calculate the variance explained by these random effect
(8.988e+03+5.554e-93)/(8.988e+03+5.554e-93+3.755e+03)
## [1] 0.7053284
Not much more tha just species by itself, HOWEVER we know a priori that population should nest within species (because we expect individuals of the same species to have e.g. similar reproductive rates, behaviours, mortality rates etc) and so we should include this in the model, regardless of whether it explains more of the variance (although we would expect it would).
We can also specify crossed random effects, where two or more variables can create distinct groupings:
y ~ x1 + x2 + (1|z1) + (1|z2)
Whether a random effect is nested or crossed can depend upon what scale you are looking at, so it isn’t neccesarily trivial to decide whether to nest or cross your random effects (and a full discussion is beyond today’s work) but there are some resources to help you get your head around it at the bottom of this workbook.
We will stick with our nested model design as there isnt any variable in the data which would suggest that we should include a crossed design:
abundance ~ standardised_time + n.threats + standardised_time:n.threats + (1|species/population)
So far we have considered what random effects should be in the model, but haven’t actually looked at the model fit! The basic approach of plotting a QQ plot we did above doesn’t mathmatically work well for GLMMs, however there are some handy tools in the DHARMa package which can help us assess the model fit more robustly. These work by using the model fit to simulate residuals which incorporate the uncertaintly in the model, see ?simulateResiduals for a good explanation.
##load DHARMa
library(DHARMa)
## simulate the residuals from the model
##setting the number of sims to 1000 (more better, according to the DHARMa help file)
m_mod2_sim <- simulateResiduals(m_mod2, n = 1000)
##plot out the residuals
plot(m_mod2_sim)
Ok so what do these plots tell us? Well on the left is a version of the QQ plot we saw earlier, and we read it in exactly the same way as a normal QQ plot. We also get some statistical tests to see how well our data fit the distribution - if we have a significant (p<0.05) test score then our data are significantly DIFFERENT from the distribution specified in the model - you can see here that this is the case.
On the right we have a version of the residuals vs predicted plot - we are looking for solid red lines which are close to horizontal (actually close to tracking the dashed red horizontal lines at 0.25, 0.5, and 0.75 - the quartiles of the data). Again, we can see our data are a poor fit and we get a warning about the quantile deviations in the title.
Conclusion? Our model is doing a poor job of fitting to the data.
Task
Head back to the online qqplot visualisations and play with the sliders and see if you can determine what sort of skew in the data we are dealing with.
So looking at the online model I would say we are looking at long tailed positively skewed data when compared to a normal distribution. Let’s see if we can solve this.
This is where understanding our data is important - why are we seeing these patterns? Well, one reason is because of this:
We have time series which go extinct, and then we have a series of 0’s recorded until the end of the time series (plotted as points in the graphic above).
Task
Consider whether we should include these 0’s in our analysis.
Well, let’s think about fitting a linear regression to the data above, where we either (1) include the zeros (red line in the plot below) or (2) don’t include them (yellow line):
You can see that - quite obviously - there is a big difference in our estimates of the rate of change of these populations over time and I hope that it should be obvious that we don’t want to include all of the 0’s in the analysis.
So do we just remove all of the 0s?
Well…no, because the first 0 (in chronological order) still gives us valid and important information on how fast the population is declining, so we should just exclude all the rest of the 0’s after that.
So how do we go about doing this?
Well, first we need to write a function to return the data we want (only one 0). Luckily our data finish with 0, 0, 0, 0 etc and don’t have something like 0, 0, 1, 0, 0 (which would make things trickier to code). This means we can tell R to return the data up to the first instance of a 0 being present:
##a function to return data up to and including the first 0 count:
single_zero <- function(x, group_info){
if(nrow(x)>0){
x <- x[ order(x$standardised_time) ,]
if( min(x$abundance) == 0){
return( x[1:min(which(x$abundance == 0)),] )
}else{return(as_tibble(x))}
}
}
Task
See if you can work out what the above function does. I suggest running it on a small subset of the full data, e.g. by using the [[2]] approach above to select out a data frame from the split_spp list. Hint - the group_info section in function(x, group_info) is necessary for the group_map() function to work, so you can ignore this.
Ok, so now we have a function (and you have checked it works using a subset of the data) then we can apply this function to our list. The tidyverse provides tools to do that - namely the group_map() function:
## make a new data frame
species_single0 <- long_spp %>%
##we want to cut the long_spp data up by the values in species and population
group_by(species, population) %>%
##the group_map function allows us to apply a function to these groups
##and we want to keep the grouping variables, hence keep = T
group_map(single_zero, keep = T) %>%
##and then bind it all back together again to a tibble
##otherwise it is returned as a list of the group data
bind_rows()
## Warning: The `keep` argument of `group_map()` is deprecated as of dplyr 1.0.0.
## Please use the `.keep` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
Let’s check whether this has worked by looking at the data we plotted earlier
## # A tibble: 9 x 3
## species standardised_time abundance
## <chr> <dbl> <dbl>
## 1 Acaulon triquetrum 0 189
## 2 Acaulon triquetrum 52.3 114
## 3 Acaulon triquetrum 104. 88
## 4 Acaulon triquetrum 157. 36
## 5 Acaulon triquetrum 209. 3
## 6 Acaulon triquetrum 261 1
## 7 Acaulon triquetrum 313. 1
## 8 Acaulon triquetrum 365. 1
## 9 Acaulon triquetrum 417. 0
Great! It ends in a single 0. We can continue.
So our model didn’t fit well earlier, but might fit better now we have removed the data that shouldn’t have been there in the first place! So, let’s do some exploration.
So, we could fit a GLMM with a log link (as this is what worked best in the first example on a single species), however becasue we have 0 values in the data (and you can’t log a zero) we need to add one to the data. Additionally, we are using count data, so really the gaussian distribution isn’t a good option (unless the mean of the data are high, as they were when we fit the model to the single time series, but aren’t now we are dealing with values which decline down to 0).
So, what other options do we have? Well, count data are a very common kind of data to have, so luckily there is a raft of tools available for them. The following are some of the most commonly used count data distributions which can be fit in glmmTMB (see here:
There are also truncated version of many of the above, but if you plot a histogram of our data then it doesn’t look like our data our truncated:
This is a fairly baffling array of options, but there are some tools we can use to help us assess these data.
Note - all of the below distribution identification techniques work well (often better) for GLMs.
The fitdistrplus library allows us to compare our data to distributions fit to it:
##load the fitdistrplus package
library(fitdistrplus)
##fit a poisson distribution to our data:
fit_pois <- fitdist(species_single0$abundance,
distr = "pois")
This throws up some warnings. After some digging these are because we have some zero frequencies in our data (we can ignore this).
##plot the data
plot(fit_pois)
##look at the summary statistics
gofstat(fit_pois)
## Chi-squared statistic: Inf
## Degree of freedom of the Chi-squared distribution: 19
## Chi-squared p-value: 0
## the p-value may be wrong with some theoretical counts < 5
## Chi-squared table:
## obscounts theocounts
## <= 1 2.800000e+01 6.081900e-57
## <= 5 3.300000e+01 2.049747e-50
## <= 11 3.100000e+01 5.019744e-43
## <= 19 3.100000e+01 2.708704e-35
## <= 30 2.700000e+01 5.877842e-27
## <= 41 2.900000e+01 2.226201e-20
## <= 55 2.700000e+01 8.179528e-14
## <= 68 2.800000e+01 4.194862e-09
## <= 81 2.700000e+01 1.849605e-05
## <= 92 2.800000e+01 4.473910e-03
## <= 102 2.700000e+01 2.142684e-01
## <= 121 3.000000e+01 2.940196e+01
## <= 145 3.200000e+01 3.654994e+02
## <= 162 2.900000e+01 1.832680e+02
## <= 187 2.700000e+01 2.156040e+01
## <= 222 2.800000e+01 5.147652e-02
## <= 249 2.700000e+01 6.213563e-08
## <= 293 2.700000e+01 6.661338e-14
## <= 356 2.700000e+01 0.000000e+00
## <= 436 2.700000e+01 0.000000e+00
## > 436 3.000000e+01 0.000000e+00
##
## Goodness-of-fit criteria
## 1-mle-pois
## Akaike's Information Criterion 82929.96
## Bayesian Information Criterion 82934.35
In the first plot (on the left) the red density shows the EXPECTED distribution of the data if they were from a poisson distribution, whilst the actual data are shown by the thin black bars - what we are looking for is the bars following the same distribution as the red area. You can instantly see that they aren’t the same, and this is backed up by the right hand plot (the cumulative density function) where the black and red lines should mimic each other.
This is further backed up by the p-value from the Goodness-of-fit test for poisson distribution - a significant P value here denotes a significant difference from the distribution we are testing the data against (poisson).
Conclusion - the poisson distribution is a bad fit for our data.
What about the negative negative distribution? This is a great (and more fleixible) option for fitting to count data:
##fit a nbinom to the data instead
fit_nbinom <- fitdist(species_single0$abundance,
dist = 'nbinom')
##again we get warnings from those missing values and can ignore
##plot it out:
plot(fit_nbinom)
That looks a lot better! We are still underpredicting throughout the distribution, but the CDF looks good.
##the goodness of fit
gofstat(fit_nbinom)
## Chi-squared statistic: 31.67363
## Degree of freedom of the Chi-squared distribution: 18
## Chi-squared p-value: 0.02402514
## Chi-squared table:
## obscounts theocounts
## <= 1 28.00000 19.97953
## <= 5 33.00000 26.85885
## <= 11 31.00000 31.87077
## <= 19 31.00000 35.46763
## <= 30 27.00000 41.34348
## <= 41 29.00000 35.67087
## <= 55 27.00000 39.52625
## <= 68 28.00000 32.15955
## <= 81 27.00000 28.64267
## <= 92 28.00000 21.91913
## <= 102 27.00000 18.32195
## <= 121 30.00000 31.17071
## <= 145 32.00000 33.64150
## <= 162 29.00000 20.60556
## <= 187 27.00000 26.26787
## <= 222 28.00000 30.16536
## <= 249 27.00000 19.02378
## <= 293 27.00000 24.82173
## <= 356 27.00000 25.61594
## <= 436 27.00000 21.19383
## > 436 30.00000 35.73303
##
## Goodness-of-fit criteria
## 1-mle-nbinom
## Akaike's Information Criterion 7114.463
## Bayesian Information Criterion 7123.257
But we still have a signifiacant difference from the nbimon distribution (p<0.05). However, remember that above we are looking at the distribution of the WHOLE data, and we need to ensure that the residuals fit the model assumptions. So the above is SUGGESTING that the poisson is unlikely to be a good fit, and the the negative binomial might be better. Better still, we might consider a zero inflated negative binomial - which can account for a larger number of 0s than expected given the negative binomial distribution. So we might like to consider the following distributions for our model:
Let’s start by fitting the poisson model as a starting point. We are expecting this to fit the data poorly, but we can try it anyway. We are going to used scaled predictor variables.
Why standardised data?
Just as a quick aside - we often scale() data (so the data have a mean of zero (“centering”) and standard deviation of one (“scaling”)) when the continuous predictor variables we are using are on very different scales. In our case we have one (number of threats) which is between 0 and 3 and one (standardised time) which is between 0 and 991. This can cause issues with the model fitting, as the coefficents for one of your predictors might be vanishingly small (and therefore hard to estimate) when compared to your other predictors.
So here we fit our model, with scaled parameters, and population nested within species as our random effects:
## fit a poisson model
ps_mod <- glmmTMB(abundance ~ scale(standardised_time) * scale(n.threats) + (1 | species/population),
data = species_single0,
family="poisson")
##summarise the model output
summary(ps_mod)
## Family: poisson ( log )
## Formula:
## abundance ~ scale(standardised_time) * scale(n.threats) + (1 |
## species/population)
## Data: species_single0
##
## AIC BIC logLik deviance df.resid
## 13277.4 13303.8 -6632.7 13265.4 594
##
## Random effects:
##
## Conditional model:
## Groups Name Variance Std.Dev.
## population:species (Intercept) 1.409e+00 1.1870115
## species (Intercept) 4.827e-07 0.0006948
## Number of obs: 600, groups: population:species, 59; species, 51
##
## Conditional model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.158693 0.156990 26.49 < 2e-16
## scale(standardised_time) -0.547610 0.006039 -90.68 < 2e-16
## scale(n.threats) -0.995613 0.150798 -6.60 4.05e-11
## scale(standardised_time):scale(n.threats) -0.558804 0.005773 -96.80 < 2e-16
##
## (Intercept) ***
## scale(standardised_time) ***
## scale(n.threats) ***
## scale(standardised_time):scale(n.threats) ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
If we wanted to look at how well a linear regression fits the data we tend to use r-squared as a goodness of fit measure for our model (an r2 of 1 means the model exactly predicts the data). This measure falls apart when we consider GLMs and even more so when we consider GLMMs but we can calcualte a simple pseudo r2 value to help us roughly estimate the goodness of fit. There are a number of simple and complex ways to do this, but we will just implement the following:
##function to calculate a psudo R2 value for our GLMM
r2.corr.mer <- function(m) {
lmfit <- lm(model.response(model.frame(m)) ~ fitted(m))
summary(lmfit)$r.squared
}
## apply it to our poisson model
r2.corr.mer(ps_mod)
## [1] 0.8672486
0.867 is a very good r2. There are also more complex methods e.g. that available in the MuMin package:
##r squared calcualted by MuMIn:
MuMIn::r.squaredGLMM(ps_mod)
## R2m R2c
## delta 0.5182369 0.9981549
## lognormal 0.5182394 0.9981599
## trigamma 0.5182343 0.9981499
Which is lower.
We can also use a simple function to look at our overdispersion in the model:
##function to calculate overdispersion
overdisp_fun <- function(model) {
rdf <- df.residual(model)
rp <- residuals(model,type="pearson")
Pearson.chisq <- sum(rp^2)
prat <- Pearson.chisq/rdf
pval <- pchisq(Pearson.chisq, df=rdf, lower.tail=FALSE)
c(chisq=Pearson.chisq,ratio=prat,rdf=rdf,p=pval)
}
##apply to our data
overdisp_fun(ps_mod)
## chisq ratio rdf p
## 9357.20287 15.75287 594.00000 0.00000
The data are considered overdispersed when the chisq value is > the rdf value - i.e. when the ratio is > 1. Clearly these data are overdispersed!
Note - overdispersion is an issue only for distributions which dont estimate a scale parameter (i.e. poisson and binomial).
So let’s fit some other potential models to the data. We can update() the model structure so we don’t have to keep typing out the full model again:
###try with nbinom:
nb1_mod <- update(ps_mod, family="nbinom1")
##and the second parameterisation of nb
nb2_mod <- update(ps_mod, family="nbinom2")
We should also consider zero inflated versions of the NB distributions, as we know that we have 0s in our data. glmmTMB allows us to assume that all of the data are equally likely to have zero inflation (using the ziformula = ~1 argument), or alternatively to specify structures in the data which might explain our increase in zeros. In our case, if we look at our data plotted out above, we can see that (and logically this makes sense) the number of threats is likely to be a key driver of the number of zeros, as the number of threats seems to determine whether species are declining or not. In this case we can specify the number of threats as the potential driver of zeros in our data:
##zero inflated version of these with zeros being attributed to number of threats:
Zi_nb1_mod <- update(nb1_mod, ziformula = ~n.threats)
Zi_nb2_mod <- update(nb2_mod, ziformula = ~n.threats)
##and we can also look at the zero inflated version of the poisson model:
Zi_ps_mod <- update(ps_mod, ziformula = ~n.threats)
Ok, so let’s compare the model fits of these three models. We can do this using the anova() (analysis of variance) function which gives us additional information on top of what AIC() returns to us:
##compare the models
anova(ps_mod,
Zi_ps_mod,
nb1_mod,
nb2_mod,
Zi_nb1_mod,
Zi_nb2_mod)
## Data: species_single0
## Models:
## ps_mod: abundance ~ scale(standardised_time) * scale(n.threats) + (1 | , zi=~0, disp=~1
## ps_mod: species/population), zi=~0, disp=~1
## nb1_mod: abundance ~ scale(standardised_time) * scale(n.threats) + (1 | , zi=~0, disp=~1
## nb1_mod: species/population), zi=~n.threats, disp=~1
## nb2_mod: abundance ~ scale(standardised_time) * scale(n.threats) + (1 | , zi=~n.threats, disp=~1
## nb2_mod: species/population), zi=~n.threats, disp=~1
## Zi_ps_mod: abundance ~ scale(standardised_time) * scale(n.threats) + (1 | , zi=~0, disp=~1
## Zi_ps_mod: species/population), zi=~0, disp=~1
## Zi_nb1_mod: abundance ~ scale(standardised_time) * scale(n.threats) + (1 | , zi=~0, disp=~1
## Zi_nb1_mod: species/population), zi=~n.threats, disp=~1
## Zi_nb2_mod: abundance ~ scale(standardised_time) * scale(n.threats) + (1 | , zi=~n.threats, disp=~1
## Zi_nb2_mod: species/population), zi=~n.threats, disp=~1
## Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
## ps_mod 6 13277.4 13303.8 -6632.7 13265.4
## nb1_mod 7 6113.1 6143.9 -3049.6 6099.1 7166.3 1 <2e-16 ***
## nb2_mod 7 6220.4 6251.1 -3103.2 6206.4 0.0 0 1
## Zi_ps_mod 8 12477.4 12512.5 -6230.7 12461.4 0.0 1 1
## Zi_nb1_mod 9 6041.6 6081.2 -3011.8 6023.6 6437.7 1 <2e-16 ***
## Zi_nb2_mod 9 6107.2 6146.8 -3044.6 6089.2 0.0 0 1
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We can see the zero inflated version NB1 model (Zi_nb1_mod) seems to fit the best of the models tested (with Zi_nb1_mod having the (lowest AIC) and if we look at the psudo-r2 values we can see that it is still fitting well:
Task
Use the r2.corr.mer() function to calculate the psudo r2 value for the Zi_nb1_mod.
##psudo r2 value
r2.corr.mer(Zi_nb1_mod)
## [1] 0.8599625
Sadly the MuMIn::r.squaredGLMM approach we tried earlier doesn’t (yet) handle zero inflated version of the NB in glmmTMB, however this model is looking like a pretty good fit for our data.
One other thing we might want to consider is correlations structures in our data. In this case we are fitting models to time series and because the value of any given time point relies on the value of the previous time point, we might need to consider some underlying stucture (autocorrelation) in the data we are analysing. We can imliment this in glmmTMB:
##fit a zero inflated negative binomial model with autocorrelation structure defined by the time variable
Zi_nb1_ar1_mod <- glmmTMB(abundance ~ scale(standardised_time) * scale(n.threats) + (1|species/population) + ar1(factor(scale(standardised_time)) - 1|species/population),
data = species_single0,
ziformula=~n.threats,
family="nbinom1")
## Warning in fitTMB(TMBStruc): Model convergence problem; non-positive-definite
## Hessian matrix. See vignette('troubleshooting')
## Warning in fitTMB(TMBStruc): Model convergence problem; false convergence (8).
## See vignette('troubleshooting')
Note - we have to include the predictor for ar1() as a factor (thats just the way that glmmTMB needs it to be, don’t ask!), hence the factor(scale(standardised_time)).
Trying to fit this throws up a number of warnings/errors. If you run vignette('troubleshooting') you get some very useful help on these warnings. We can solve the main one (false convergence) fairly easily, by implementing a different maximumum likelihood optimizer:
##fit a zero inflated negative binomial model with autocorrelation structure defined by the time variable
Zi_nb1_ar1_mod <- glmmTMB(abundance ~ scale(standardised_time) * scale(n.threats) + (1|species/population) + ar1(scale(standardised_time)-1|species/population),
data = species_single0,
ziformula=~n.threats,
family="nbinom1",
##the control parameter specifying the optimizer to use:
control=glmmTMBControl(optimizer=optim, optArgs=list(method="BFGS")))
## Warning in fitTMB(TMBStruc): Model convergence problem; non-positive-definite
## Hessian matrix. See vignette('troubleshooting')
which solves one of the issues, but not:
Model convergence problem; non-positive-definite Hessian matrix
The troubleshooting vignette gives the following as probable reasons for this warning:
If we look at the fixed effects of the model:
##show the fixed effects
fixef(Zi_nb1_ar1_mod)
##
## Conditional model:
## (Intercept)
## 4.3060
## scale(standardised_time)
## -0.5596
## scale(n.threats)
## -0.8121
## scale(standardised_time):scale(n.threats)
## -0.5799
##
## Zero-inflation model:
## (Intercept) n.threats
## -13.892 3.581
You can see that the zero inflation parameters (number 3) are not near zero. We also don’t have a dispersion parameter in the model (4) and we aren’t fitting a binomial model (5). Which leaves us with (1) and (2). If we look at the random effect sizes from the zero inflated negative binomial model without autocorrelation:
##look at the model
Zi_nb1_mod
## Formula:
## abundance ~ scale(standardised_time) * scale(n.threats) + (1 |
## species/population)
## Zero inflation: ~n.threats
## Data: species_single0
## AIC BIC logLik df.resid
## 6041.627 6081.199 -3011.813 591
## Random-effects (co)variances:
##
## Conditional model:
## Groups Name Std.Dev.
## population:species (Intercept) 0.707994
## species (Intercept) 0.000331
##
## Number of obs: 600 / Conditional model: population:species, 59; species, 51
##
## Overdispersion parameter for nbinom1 family (): 15.2
##
## Fixed Effects:
##
## Conditional model:
## (Intercept)
## 4.4345
## scale(standardised_time)
## -0.4922
## scale(n.threats)
## -0.7201
## scale(standardised_time):scale(n.threats)
## -0.5032
##
## Zero-inflation model:
## (Intercept) n.threats
## -8.471 1.810
We can see that the random effect of species is quite small, so we could consider removing this from the model with autocorrelation and trying it again:
##zero inflated negative binomial model with autocorrelation and population as random effects
Zi_nb1_ar1_mod <- glmmTMB(abundance ~ scale(standardised_time) * scale(n.threats) + (1|population) + ar1(scale(standardised_time)-1|population),
data = species_single0,
ziformula=~n.threats,
family="nbinom1",
##the control parameter specifying the optimizer to use:
control=glmmTMBControl(optimizer=optim, optArgs=list(method="BFGS")))
## Warning in fitTMB(TMBStruc): Model convergence problem; non-positive-definite
## Hessian matrix. See vignette('troubleshooting')
However you can see we are still getting the same error so we will conclude that in this case the model is probably over parameterised (1). So let’s fall back to our previous best model -Zi_nb1_mod. If we thought there was likely to be significant autocorelation structure then we would need to think about going back and simplifying our original model.
The DHARMa package allows us to implement some great diagnostic tools to check our model fits (only defined for some distributions at the moment, luckily the zero inflated NB is one of them):
##load DHARMa
library(DHARMa)
## simualte the residuals from the model
##setting the number of sims to 1000 (more better, according to the DHARMa help file)
Zi_nb1_mod_sim <- simulateResiduals(Zi_nb1_mod, n = 1000)
##plot out the residuals
plot(Zi_nb1_mod_sim)
We can interpret the QQ plot as normal - and it looks good! We have no significant values for any of the tests (which would indicate that the distribution is a poor fit to the data).
On the right we have a plot of the predictions vs residuals. Here we are hoping to see straight red solid lines across the plot which match the dashed red lines on the plot, but as you can see there is some deviation from that and we are getting the warning that quantile deviations detected in the residuals. This isn’t ideal but visually doesn’t appear very bad in this case. Ideally we would delve into this some more (see below in More possible models). There is an excellent (and long) vignette for DHARMa available here which you can delve into.
DHARMa also provides a suite of other test functions which allow us to run diagnostics on the simulated outputs from simulateResiduals():
## test to see where there are outliers, in our case not significant so we dont need to worry
testOutliers(Zi_nb1_mod_sim,
plot = TRUE)
##
## DHARMa bootstrapped outlier test
##
## data: Zi_nb1_mod_sim
## outliers at both margin(s) = 0, observations = 600, p-value = 0.9
## alternative hypothesis: two.sided
## percent confidence interval:
## 0.00000 0.02175
## sample estimates:
## outlier frequency (expected: 0.00338333333333333 )
## 0
## tests if the simulated dispersion is equal to the observed dispersion,
## again not significant so no need to worry
testDispersion(Zi_nb1_mod_sim,
plot = TRUE)
##
## DHARMa nonparametric dispersion test via sd of residuals fitted vs.
## simulated
##
## data: simulationOutput
## ratioObsSim = 0.78004, p-value = 0.39
## alternative hypothesis: two.sided
## compares the distribution of expected zeros in the data against the observed zeros
## this is right on the borderline of being significant, suggesting there might be a better structure for
## our zero inflation parameter (remember we used ~n.threats). That might be worth looking into further
testZeroInflation(Zi_nb1_mod_sim,
plot = TRUE)
##
## DHARMa zero-inflation test via comparison to expected zeros with
## simulation under H0 = fitted model
##
## data: simulationOutput
## ratioObsSim = 0.57281, p-value = 0.05
## alternative hypothesis: two.sided
## see if there is temporal autocorrelation in the residuals
## not significant, so it turns out we didnt need to try and fit the autocorrelation model earlier on!
testTemporalAutocorrelation(Zi_nb1_mod_sim,
plot = TRUE)
## DHARMa::testTemporalAutocorrelation - no time argument provided, using random times for each data point
##
## Durbin-Watson test
##
## data: simulationOutput$scaledResiduals ~ 1
## DW = 2.0987, p-value = 0.2261
## alternative hypothesis: true autocorrelation is not 0
So overall it seems like our model is a pretty good fit to the observed data.
It might be worth just briefly considering what the implications of a poorly fitting model are, and how much of an issue is it? Well this really depends upon the questions you are interested in, and what you want to do with your model.
In our case, we are interested to see if there is a significant effect of the number of threats on how population abundances change through time. In this case, we aren’t neccesarily interested in the exact values of the coefficients (if the effect of an additional stressor is -0.10224 vs -0.10890 doesn’t make a big difference to us). We do want to be sure that the patterns we are seeing in the model are reflected in the data, and the fact that they are highly significant and the effect sizes are large means we can be quite confident in this even if the model fit isn’t “perfect”.
However, if we have marginal effects with small effect sizes then we are likely to be more concerned about how well our model fits to be sure whether the effects we are reporting are in fact observed in the data, rather than a function of a poor model fit.
Likewise, if we are fitting a model to data and then using that model to - say - parameterise a simulation, or project the trends beyond the data we have collected then we are again going to be much more concerned about how well our model fits, as any errors in the coefficient estimates are going to be propagated as we use those estimates in other analyses. In this case we might want to fit something more flexible to our data, like a Generalised Additive Model (GAMs) or a Generalised Additive Mixed Model (GAMMs) which can deal well with non-linear effects and thus will make reliable predictions within the bounds of the data you are fitting the model to, but are difficult to interpret when trying to identify the effects of fixed effects and cannot be used to extrapolate beyond the data you have. You can see the flexibility of GAMs below (purple line) vs a linear model (yellow line).
In our case, we will will continue with the best model we identified above and look at interpreting and plotting the outputs.
As you can see it has taken a long time to get to here, and if your data is very complex, or very non-normal, this process takes even longer. I like to do one final easy check to look at how well our model is working, and this is to simply compare the model predictions to the observed values. We will go back to our old friend the predict()function for this:
## add in the predicted values from the model:
species_single0$predicted <- predict(Zi_nb1_mod,
data = species_single0,
type = "response")
##plot the predicted against the observed
ggplot(species_single0, aes(x = abundance,
y = predicted)) +
geom_point(col="grey") +
geom_abline(slope = 1) +
theme_minimal() +
xlab("Observed") +
ylab("Predicted")
The points should then fall close to the 1:1 line, and you can see our model does a pretty good job. We can see that it both underpredicts and over predicts the values in some places, but we expect that (we won’t ever have a model which fits perfectly, after all - the whole point of a model is to simplify the patters in data so we can interpret them) and there aren’t any clear patterns (like only underpredicting at high observed values).
Phew! All that effort was worth it!
There are of course a lot more ways to go about visualising models, and Ben Bolker gives a much more detailed deep dive into diagnostic plotting here, which I highly reccomend you look through.
We said before we might want to consider a more complex zero-inflation structure than the one we currently have, and we may also want to think about explicitly modelling a dispersion parameter (although our tests above don’t highlight this as a particular issue, its worth knowing this is possible), e.g. n.threats might be a predictor of increased dispersion in the data. glmmTMB allows us to do this, but we aren’t going to dig into that today. However if we weren’t confident about the model fits (low pseudo r2, poor predicted vs observed plots, etc), or wanted to check some other model structures, that would be a good place to start.
So finally, what does our model say?
Task
Print the summary of the Zi_nb1_mod.
##summarise the model
summary(Zi_nb1_mod)
## Family: nbinom1 ( log )
## Formula:
## abundance ~ scale(standardised_time) * scale(n.threats) + (1 |
## species/population)
## Zero inflation: ~n.threats
## Data: species_single0
##
## AIC BIC logLik deviance df.resid
## 6041.6 6081.2 -3011.8 6023.6 591
##
## Random effects:
##
## Conditional model:
## Groups Name Variance Std.Dev.
## population:species (Intercept) 5.013e-01 0.707994
## species (Intercept) 1.095e-07 0.000331
## Number of obs: 600, groups: population:species, 59; species, 51
##
## Overdispersion parameter for nbinom1 family (): 15.2
##
## Conditional model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.43446 0.09774 45.37 < 2e-16
## scale(standardised_time) -0.49215 0.02231 -22.06 < 2e-16
## scale(n.threats) -0.72013 0.09532 -7.55 4.21e-14
## scale(standardised_time):scale(n.threats) -0.50318 0.02139 -23.52 < 2e-16
##
## (Intercept) ***
## scale(standardised_time) ***
## scale(n.threats) ***
## scale(standardised_time):scale(n.threats) ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Zero-inflation model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.4707 1.3971 -6.063 1.34e-09 ***
## n.threats 1.8103 0.5485 3.300 0.000966 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Well, it says:
There are a host of ways to plot out but this one is adapted from the glmmTMB paper appendix:
## make a blank data set which includes the variables in the model
## we will then use this to generate predicted values from the model for
## various combinations of number of threats, standardised time, species,
## and population
## we can use the unique() function across the columns in our data.frame
## to retrieve every unique combination of:
##n.threats, standardised_time, species, and population
new_data <- unique(species_single0[,c("n.threats",
"standardised_time",
"species",
"population")])
##scale the relevant columns (remember our model is expecting scaled data)
new_data$n.threats<-scale(new_data$n.threats)
new_data$standardised_time<-scale(new_data$standardised_time)
##set the random effects of the model to zero
X_cond <- model.matrix(lme4::nobars(formula(Zi_nb1_mod)[-2]), new_data)
beta_cond <- fixef(Zi_nb1_mod)$cond
pred_cond <- X_cond %*% beta_cond
ziformula <- Zi_nb1_mod$modelInfo$allForm$ziformula
X_zi <- model.matrix(lme4::nobars(ziformula), new_data)
beta_zi <- fixef(Zi_nb1_mod)$zi
pred_zi <- X_zi %*% beta_zi
##transform point estimates of the unconditional counts to the response scale and multiply
##(because they are logged and on logit link)
pred_ucount = exp(pred_cond)*(1-plogis(pred_zi))
##load the MASS library
library(MASS)
##set the random number generator seed
set.seed(101)
##use posterior predictive simulations to generated upper and lower confidence intervals
## and median predicted counts
##ignoring variabtion in the random effects
##conditional
pred_condpar_psim = mvrnorm(1000,mu=beta_cond,Sigma=vcov(Zi_nb1_mod)$cond)
pred_cond_psim = X_cond %*% t(pred_condpar_psim)
##zero inflation parameter
pred_zipar_psim = mvrnorm(1000,mu=beta_zi,Sigma=vcov(Zi_nb1_mod)$zi)
pred_zi_psim = X_zi %*% t(pred_zipar_psim)
##transform them
pred_ucount_psim = exp(pred_cond_psim)*(1-plogis(pred_zi_psim))
##calculate 95% CIs
ci_ucount = t(apply(pred_ucount_psim,1,quantile,c(0.025,0.975)))
ci_ucount = data.frame(ci_ucount)
##rename
names(ci_ucount) = c("ucount_low","ucount_high")
##put into a data frame
pred_ucount = data.frame(new_data,
pred_ucount,
ci_ucount)
##we need to reverse the scaling of our predictor variables for our plots to make sense
##the scale() function stores attributes of the scaling in the vectors of scaled data
## try running new_data$n.threats and looking at the bottom values
##write a function to do this:
unscale <- function(x){
x * attr(x, 'scaled:scale') + attr(x, 'scaled:center')
}
##unscale the variables
pred_ucount$n.threats_unscaled <- unscale(pred_ucount$n.threats)
pred_ucount$standardised_time_unscaled <- unscale(pred_ucount$standardised_time)
##load the viridis package (colourblind friendly palletes)
library(viridis)
##plot out the predicted median values for abundance
## in response to time (x-axis)
##and grouped by the number of threats
ggplot(pred_ucount, aes(x = standardised_time_unscaled,
y = pred_ucount,
group = n.threats_unscaled,
col = n.threats_unscaled))+
##median lines for each number of threats
geom_line() +
##add in a geom_ribbon to show the 95% CI
geom_ribbon(aes(ymin = ucount_low,
ymax = ucount_high),
alpha = 0.1,
col = "grey",
linetype=0) +
##minimal theme
theme_minimal() +
##set x and y axes labels
ylab("Predicted\nabundance") + xlab("Time\n(weeks)") +
##viridis colour pallette for continuous data
scale_colour_viridis_c() +
##move legend to the top
theme(legend.position = "top") +
##rename the legend
labs(colour="Number of threats")
Brilliant! we have a clear visualisation of the effects of multiple threats on the dynamics of populations through time, and we can say with some certainty that the more stressors you have the faster you are likely to be declining.
Let’s quickly summarise what we have learned today:
Phew! Thats a lot to cover in one week. Don’t worry if you didn’t manage to work through everything or some parts you find difficult - you can always come back to this and refer to it at any time.
| Function/operator | Call |
|---|---|
| Apply smoothing visualisation to ggplot() | geom_smooth() |
| Calculate difference between dates | difftime() |
| Run a GLM model | glm() |
| Return predicted values from a model | predict() |
| Return the residuals of a glm | resid() |
| Make a QQ plot | qqnorm() |
| Add a QQ line to a QQ plot | qqline() |
| Compare models using AIC | AIC() |
| Summarise a model | summary() |
| Order the items in a column | order() |
| Fit a glmm with TBM | glmmTMB |
| Create scaled residuals from a glmm | simulateResiduals() |
| Apply a function to grouped data in tidyverse | group_map() |
| Fit a distribution to observations | fitdist() |
| Generate summary statistics from a fitted distribution | gofstat() |
| Calculate pseudo R-squared for glmms | r.squaredGLMM() |
| Update a model with a new argument | update() |
| Compare models with AIC and give more info | anova() |
Test for significant outliers on simulateResiduals() output |
testOutliers() |
Test for significant dispersion on simulateResiduals() |
testDispersion() |
Test for significant zero inflation on simulateResiduals() |
testZeroInflation() |
Test for significant temporal autocorrelation in the residuals from simulateResiduals() |
testTemporalAutocorrelation() |
| Return fixed effects of glmm | fixef() |
| Compute exponential | exp() |
| Plot a ribbon geom | geom_ribbon() |
| Apply a continuous viridis colour scheme | scale_colour_viridis_c() |
A brief introduction to mixed effects modelling and multi-model inference in ecology
Hierarchical generalized additive models in ecology: an introduction with mgcv
Ben Bolker’s fantastic walk through of fitting GLMMs and problem solving them
So this week’s homework will build on our delve into fitting GLMs to data. In your cloned Bioinformatics_data repository in the Workshop 5 folder you will find four different data sets. You will need to analyse these and for each identify:
You can do these in a GLM framework as there aren’t any random effects in the data.
Answers to be revealed at the start of next week!